A hybrid residue based sequential encoding mechanism with XGBoost improved ensemble model for identifying 5-hydroxymethylcytosine modifications

RNA modifications play an important role in actively controlling recently created formation in cellular regulation mechanisms, which link them to gene expression and protein. The RNA modifications have numerous alterations, presenting broad glimpses of RNA’s operations and character. The modification process by the TET enzyme oxidation is the crucial change associated with cytosine hydroxymethylation. The effect of CR is an alteration in specific biochemical ways of the organism, such as gene expression and epigenetic alterations. Traditional laboratory systems that identify 5-hydroxymethylcytosine (5hmC) samples are expensive and time-consuming compared to other methods. To address this challenge, the paper proposed XGB5hmC, a machine learning algorithm based on a robust gradient boosting algorithm (XGBoost), with different residue based formulation methods to identify 5hmC samples. Their results were amalgamated, and six different frequency residue based encoding features were fused to form a hybrid vector in order to enhance model discrimination capabilities. In addition, the proposed model incorporates SHAP (Shapley Additive Explanations) based feature selection to demonstrate model interpretability by highlighting the high contributory features. Among the applied machine learning algorithms, the XGBoost ensemble model using the tenfold cross-validation test achieved improved results than existing state-of-the-art models. Our model reported an accuracy of 89.97%, sensitivity of 87.78%, specificity of 94.45%, F1-score of 0.8934%, and MCC of 0.8764%. This study highlights the potential to provide valuable insights for enhancing medical assessment and treatment protocols, representing a significant advancement in RNA modification analysis.

RNA, the primary molecule that is the center of cellular processes, is indispensable for synthesizing proteins and transmitting genetic information.Its presence in all living beings stresses that is where the real strength lies 1 .Composed of a complicated molecular assembly, RNA carries hereditary information and performs the function of a biological catalyst.Moving further from its role inside the cells, RNA functions and transfers genome information via certain viruses 2 .Among RNA modifications, the essential role of this process is strengthened, and currently, more than 100 RNA modifications opening up new regulation possibilities are identified 3 .However, mRNA demonstrates N6-Methyladenosine and N7-Methylguanosine modifications, regulatory tools in mRNA's different phases.Transferred RNA (tRNA) and ribosomal RNA (rRNA) present post-translational modifications like 5mC and N1MeA, which contribute to their function 4,5 .Besides forming 5hmC, a TET oxidation product, this further complicates the issue of RNA modifications 6 .The Appreciation of diverse RNA actions is vital to establish these roles, as they have far-reaching implications.All mRNA, tRNA, and rRNA molecules are involved in a complex interplay of molecular modifications that organize cellular processes, giving rise to cell-level changes in gene expression and protein synthesis and, thus, influencing the organism's functioning at the whole level 7,8 .A presentation of the fine-tunings of RNA modifications will further propel our knowledge of biological systems, ultimately leading to the creation of novel therapeutic approaches and revolutionary discoveries in molecular biology.
The 5hmC alteration was discovered, and the study that found the answer stemmed from the wheat seeds, exposing that life's boundaries are not as restricted as people think.The empirical basis of this pivotal insight is stretched over species and fields of inquiry, thus a remarkable scope and magnitude 9 .The full scale of the influence of 5hmC variation shows the way and leads to the correct interpretation of genetic phenomena in the complicated fabric of genetics 10 .5hmC modification deviates from the ratio in wheat seed and tissues of humans and mice performing RNA splicing, translation, and decay functions.It affects the process of gene expression.Thus, its role must be considered in understanding the mechanisms of epigenetic regulation and the use of eukaryotic genetic expression 11 .Furthermore, the 5hmC functional area native to man is crucial to discuss with a close connection to diseases like cancer, diabetes, and cardiovascular disease.Hence, it is essential to analyze 5hmC modifications in studying the complex human health system and establish the reason for this approach as a unique scientific contribution to medical sciences to enhance patient-tailored treatment 12 .The biochemical and chemical methods, i.e., LC-MS/MS, HPLC, and TLC, have been intensively developed to identify this new molecule (5hmC).This technique endorsed the accuracy of 5hmC alteration to an excellent level.Complementary PCR-and chromatography-based techniques are essential for resolving the problems emerging during 5hmC studies 13 .Combining different techniques, researchers are on their way to exploring 5hmC dynamics in biological processes, depicting, interpreting, and manipulating them.However, these methods have been established as accurate as the relatively complicated and protracted procedures for detecting 5hmC are performed accurately.The discovery of the 5hmC modification in wheat seeds has implications beyond transformation, revolutionizing multiple species across various biological domains.Moreover, further research is required due to its diverse impacts on genetic processes and its association with human diseases.This would help in the creation of new strategies for prevention and intervention.In addition to that, the complex techniques applied in locating 5hmC help to solve the hidden relationship, though the limitations in terms of time and cost are need to consider.
Among the recent machine learning approaches to recognizing ephemeral 5-hydroxymethylcytosine (5hmC) sites, we highlight a growing number of studies.The first research to explore the machine-learning stage was presented by Liu et al. 14 with the transition from traditional SVM algorithms to sequence-based methodology.The executed research included cutting-edge techniques that allowed for more efficient completion of the intended goals.This method also moved the scene of 5hmC classification many degrees away and presented us with the enormous unknown of the epigenetic world with its five-dimensional non-linear matrices.Ahmed et al. 15 developed the iRNA5hmC-PS model employing the PSG k-mer technique, regarded as the most efficient feature extraction, and the Logistic Regression model, the most reliable classifier algorithm.Although their contributions are evident, let us not forget that learning app use will remain reliant on traditional approaches.The striking commonalities in such models make it difficult to find proper and exact ways of forecasting 5hmC sequences.Therefore, its identification requires a lot of human effort and the application of computer resources to extract the necessary features.Recently, a breakthrough was made by Ali et al. 16 , they successively constructed a new model called iRhmC5XGB5 that switches from the commonly-used deep neural network solutions to adopt the ultra-efficient and modern XGBoost algorithm for recognizing 5hmC.This autonomous design is proof of concept and highlights XGBoost role amongst the leading algorithms.One-hot-encoding-based features were provided to train the XGBoost model, and the outcome sample stands out for high performance.Delving into the innovation, the iRhmC5XGB5 model presents a different approach and new abilities that are separated from the conventional neural network methodologies.Hence, XGBoost was considered an effective tool in generalize model training, but it also competes very successfully among other methods in revealing the intricate connection between epigenetic marks and other key biological variables.
Based on recent empirical studies, this study will introduce an intelligent and robust machine learning model, adapting and adjusting Chou's 5 steps, by extracting the discriminative features with the XGBoost model for predicting 5-hydroxymethylcytosine (5hmC) modification.Our modified machine learning model, not only directly (XGB5hmC) or indirectly, facilitates the initiator of the original machine learning.XGB5hmC uses seven feature extraction techniques that convert the RNA sequences into feature vectors.All the extracted feature vectors are finally combined into a fused feature vector.Additionally, the SHAP analysis-based high contributory features were selected by choosing prominent features and removing inessential and redundant information.Finally, among the several learning models, the XGBoost training model using a tenfold cross-validation test is selected as a training model due to its high training capabilities.
The main innovation of the paper is the following: • Firstly, this paper employed Chou's 5-step procedure by incorporating different structured and sequential- based feature encoding schemes to predict 5-hydroxymethylcytosine (5hmC) modification prediction.• Secondly, the novel SHAP interpolation-based feature selection approach was employed to choose highly relevant and discriminative features from the extracted set.• Thirdly, an improved XGBoost learning model was proposed to improve the predicted outcomes and robust- ness of 5hmC modification, surpassing traditional machine learning techniques.• Finally, the proposed optimal set-based training model is tested via an independent dataset to thoroughly evaluate model overfitting and its generalization.
The remainder of this paper is organized in the following order: Section "Literature review" of the paper describes the related work.Section "Material and methods" describes the suggested model materials and methods in detail.Section "Performance evaluation matrix" addresses the paper's experimental results, and Section "Experimental results and analysis" provides the paper conclusion and future work.

Literature review
In recent years, there has been a rapid advancement in the field of bioinformatics related to epigenetics.This progress includes the application of machine learning techniques to identify and explain the structures of 5-hydroxymethylcytosine (5hmC) sites, which are well-known yet significant epigenetic modifications that play a crucial role in the regulation of gene expression.The study of gene expression and its regulation by epigenetic mechanisms further demonstrates the intricate relationship between these processes in biological systems.This understanding aids in identifying diseases that warrant further investigation and determining the most effective treatment options.
A novel method introduced by 14 , an SVM algorithm, is integrated with an innovative one-level feature extraction method.Their proposed model achieved high accuracy in 5hmC localization through the use of cutting-edge machine learning techniques, providing new insights into epigenetics.Similarly 15 , introduced iRNA5hmC-PS, a hybrid model combining k-mer-based features of the PSG and LR features.Despite recent achievements, traditional learning models still face challenges in sequencing and quantification of 5hmC due to difficulties in feature extraction and unreliable experimental design methods.To address these issues 16 , proposed a new strategy called iRhmC5CNN, which classifies the 5hmC sequence detection problem into several tasks and employs convolutional neural networks (CNNs) for each task.The CNN model, designed and tested on a deep learning dataset using deep and one-hot encoding methods, demonstrated that deep learning can significantly benefit epigenetic research.
Further improvements have been identified, suggesting that more promising results are achievable.However, despite significant progress, many challenges remain, and additional efforts are needed to develop deep learning algorithms for detecting and predicting 5hmC sites.These algorithms can elucidate the complex mechanisms of genomic regions involved in 5hmC modification.The simpler design and faster generalization ability are crucial considerations when implementing models in epigenetic studies.Another recent model, Deep5hmC, developed by 17 , focuses on using deep neural networks and hybrid features for the accurate detection of 5hmC modifications.Recognizing the importance of RNA modification in gene regulation and epigenetic modification, they aimed to overcome the labor-intensive and costly challenges of previous RNA detection techniques.Their approach combines seven unique feature extraction methods with several machine learning procedures, including Random Forest, Naive Bayes, Decision Tree, and Support Vector Machine.The model achieved high accuracy, surpassing prior models by 7.59% through K-fold cross-validation.This advancement shows how Deep5hmC can improve the early detection of cancer and cardiovascular diseases, marking a significant step forward in RNA modification analysis.Combining multi-data and enhancing the ensemble learning setup can play a vital role in the accuracy and reliability of predictive models for 5hmC sites.Computational biologists, bioinformaticians, and experimental biologists need to advance machine-learning algorithms based on biological experiments and verify their predictions through experimentation.The new machine learning technology transforming the epigenetic field has fostered a genuine interest in the complexity of epigenetic control and its impact on cell functions and human health.Researchers can leverage both computational approaches and experimental tools to effectively understand how 5hmC modification influences gene expression regulation, development, and disease progression.

Material and methods
The 5-methylcytidine-positioned RNA modification predictor involves several interrelated steps.It begins with extracting sequences from steady-state DNA (ssDNA) and applying feature extraction techniques.The extracted features are then analyzed and optimized using feature selection techniques.Lastly, Predictive modeling employs different performance assessment parameters such as accuracy, sensitivity, specificity, Matthew Correlation Coefficient, and precision-recall scores.The complete framework of the proposed methods model representing each phase of the model is provided in Fig. 1.

Benchmark dataset
In bioinformatics and machine learning, the acquisition or selection of a valid benchmark dataset is an essential step for developing an intelligent computational model.The selection of a suitable benchmark has a high impact on the performance of a computational model.According to Chou's comprehensive review 18,19 , a valid and reliable benchmark dataset is always required for the design of a powerful and robust computational model.Hence, in this paper, we used the same benchmark datasets that were used in 16 .The selected datasets can be expressed in mathematical form using Eq. ( 1).
Vol:.( 1234567890 where, R 1 represent the total number of 5hmC samples, R + 1 represent the positive 5hmC samples and R − 1 all negative 5hmC samples.Firstly, CD-HIT software was applied using a threshold value of 80% to eliminate high resemblance sequences.Secondly, we employed a random sampling technique to select the same number of positive samples as that of the negative samples to balance the benchmark dataset.Finally, we obtained a benchmark dataset that contained a total of 1324 sequences, of which 662 are 5hmC sequences and 662 are non-5hmC samples.To assess the generalization power of our proposed model, we employed an independent dataset.To create an independent test set, the original dataset is divided into two parts: 80% of the samples are allocated for training, while 20% are reserved for testing.The 20% used for testing is randomly selected to form an independent dataset.The training phase then utilizes the remaining 80% of the data.The independent dataset consists of 264 sequences, of which 132 are positive samples and 132 are negative samples.

Feature extraction techniques
To generate prominent, reliable, and variant statistical-based discriminative descriptors, several feature encoding approaches have been utilized for the formulation of proteins, RNA, and DNA sequences 20 .The detailed overview of the proposed feature encoding schemes is presented in the below sections.

Mismatches
Mismatch 21 calculates the occurrences of k-length neighboring nucleic acids that differ by at most m mismatches (m < k).The second step of the profile is the allowance for the maximum number of m mismatches instead of the sole occurrence analysis of k-mers.For a 3-length subsequence "AAC" and max one mismatch, we need to consider 3 cases, "-AC", "A-C" and "AA-", "-" can be replaced by any nucleic acid residue.This descriptor is governed by two parameters.
K Represents the nearby nucleic acids or k-mers, which are considered "neighbors" in the analysis and m is the number of mismatches allowed.This threshold defines how flexible the matching process is 22 .A mismatching threshold is written as m, shortening the word inexact matching to match and formalized as in Eq. ( 2), the matching descriptor, labeled as c i,j describes cases when the ith class occurrences with j mismatches occur.

Accumulated Nucleotide Frequency (ANF)
The Accumulated Nucleotide Frequency (ANF) method of encoding 23 makes it possible to encode the sequence of each nucleus, as well as to take into account the nucleotide distribution within the sequence of the RNA 24 .The density d i of any nucleotide s i at position i in the RNA sequence is computed using Eq. 4.
Here, variable l denotes the resulting sequence length; s i it simply represents the last size i prefixes {s 1 , s 2 , . . ., s i } in our sequence, and q is an element belonging to the set (A, C, G, U).Let us get in-depth with the sequence "UCG UUC AUGG".Starting with position 1, we have a density value of 'U, ' which is 1, 0.5 for position 4, and 0.6 for position 5. Finally, 0.5 for position 8. 'C' carbon atoms have two times higher density when in position 2 instead of position 6, as there are 0.5 and 0.33, respectively, the 'G' nucleotide positions 3 and 9 counts for 0.33 and 0.22, respectively.Position 10 of the nucleotide seems to have the highest count, -0.3.The case of H2O at position 7 is ' A, ' which has 0.14 density.

Position-specific trinucleotide propensity based on single-strand (PSTNPSS)
PSTNPss, which stands for Position-specific trinucleotide propensity based on single-strand, is a computational method developed to analyze the characteristics of single-stranded DNA or RNA.This approach, outlined in studies 25,26 , focuses on understanding the statistical properties of trinucleotides within biological sequences 27,28 .With 64 possible trinucleotides (AAA, AAC, AAG,…,U), PSTNPs aim to capture each trinucleotide's positionspecific propensity within a given sequence.To achieve this, PSTNPss utilizes a matrix representation, typically of dimensions 64 × (L-2), where L represents the sequence length in base pairs.Each cell in this matrix corresponds to a specific trinucleotide at a particular position within the sequence.By analyzing the frequency and distribution of trinucleotides across different positions, PSTNPs provide insights into the positional preferences of trinucleotides along the single-stranded DNA or RNA sequences.This approach enables researchers to uncover patterns and trends indicating functional or structural elements within the genetic material.
In the given formula: • i ranges from 1 to 64, representing the 64 possible trinucleotides.
• j , ranges from 1 to L − 2 , where L is the sequence length, indicating the positions within the sequence.• F + (3mer i |j) and F − (3mer i |j) , denote the frequency of the ith trinucleotide (3meri) at the jth position in the positive S + and negative S − datasets.• For instance, 3mer1 corresponds to AAA, 3mer2 corresponds to AAC, and so on, up to 3mer 64 features, which corresponds to TTT.
Therefore, the sample can be represented as follows: where T denotes the transpose operator and φ u is defined as follows: The PSTNP descriptor, utilizing a statistical approach based on single-stranded DNA or RNA characteristics, has been successfully applied in predicting DNA N4-methylcytosine sites 29 .It involves calculating the frequency of each trinucleotide at different positions in positive and negative datasets, representing them in a matrix format and has shown efficacy in its predictive capabilities.

Adaptive skip dipeptide composition (ASDC)
The Adaptive Skip Dipeptide Composition (ASDC) is an advanced form of dinucleotide composition designed to incorporate correlation information between adjacent and intervening residues 30 .In ASDC, the feature vector for a given sequence is represented as follows: (3) where f vi represents the occurrence frequency of all possible dinucleotides with up to ≤ L − 1 intervening nucleotides.The ASDC descriptor has demonstrated successful applications in predicting anti-cancer peptides and cell-penetrating peptides 31 .

Dinucleotide-based auto covariance (DAC)
The Dinucleotide-based Auto Covariance (DAC) encoding 32,33 measures the correlation of the same physicochemical index between two dinucleotides separated by a lag distance along the sequence 34,35 .The DAC can be calculated as: where u is a physicochemical index, L is the length of the nucleotide sequence, P u (R i R i+1 ) is the numerical value of the physicochemical index u for the dinucleotide RiR i+1 at position i, P ′ u , and is the average value for physicochemical index u along the whole sequence: The dimension of the DAC feature vector is N × LAG N, and the number of physicochemical indices and LAG are the maximum lags lag = 1, 2, ..., LAG .

Fused feature vector
In this model, we applied five different feature encodings such as Mismatches (MisM), ASDC, DAC, PSTNPSS, and ANF to capture the nucleotide-based features keeping their residue ordering information.Moreover, to generate the high discriminative model representing the multi-perspective features, we serially concatenated the extracted features to form an individual vector covering the weakness of the individual feature vector as follows:

SHAP feature selection
It is not always straightforward to determine the biological significance of the selected descriptors.The machine learning algorithms are sometimes called "black boxes." due to their complex inner structure.Discussions on data shape in machine learning refer to the structure or size of data groups used for various tasks.This aspect is crucial for data owners as they need to determine the dataset size and how to process it efficiently.Machine learning algorithms can exhibit differences in performance and usability depending on the dataset they are trained on.Understanding the data shape is essential for preprocessing steps such as splitting the data into training and testing sets, normalization, and feature selection.Properly shaping the data provides critical insights that guide decision-making, which is a key element in the data science pipeline.SHAPley Additive Explanations (SHAP) uses cooperative game theory to distribute credit among the contributions of input features in machine learning algorithms [36][37][38] .It assigns a specific quantitative value to each feature input and tells which predicts the value better.Measurement-wise, SHAP adds various contributions from the features of interest in and out of the model.Such difference shows the extent of an element's influence on the result, and Eq. 14 shows the formal mathematical formulation.This Equation captures the incremental effect of adding feature i , to different subsets of features. where: • φ i , represents the SHAP value for the feature i.
• N , is the set of all features.
• S , is a subset of features excluding i.
• f (S) is the model's prediction given features in S.
• f (S ∪ {i}) is the model's prediction given features in S plus feature i.
Figure 2, show the main features selected for this study.Every row in these charts is a different feature, and each dot shows the SHAP value for that feature in a specific example.Red dots mean the feature's value is high, and blue dots mean it is low.The horizontal axis shows the SHAP values, which indicate how much each feature influences the model's predictions.A positive SHAP value means the feature increases the chances of AVPs (presumably a positive outcome).
In contrast, a negative value means it increases the chances of non-AVPs (a negative outcome).In our study, we tested different groups of features of varying sizes.However, we found that the group with the top 64 features greatly improved the performance of our proposed model.Moreover, to thoroughly investigate the instant-based analysis of the extracted features, we performed LIME analysis on the randomly selected instance after SHAP analysis to predict the targeted classes, as shown in Fig. 3.Where class 1 represents the positive class and class 0 represents the negative class.

Samples visualization via tSNE
In order to investigate the effects of the extracted features, we used t-distributed Stochastic Neighbor Embedding (t-SNE) based feature visualization for the hybrid feature vector before and after applying feature selection as shown in Fig. 4. t-SNE maps revealed distinct clusters, facilitating the differentiation of positive and negative samples within them.False negative and false positive samples were notably situated between true negative and true positive samples, though their occurrences were infrequent.tSNE approach represents the local structural information as well as the global structural relationships 39 .The extracted features are further visualized using the t-SNE approach to convert the high-dimension vector into 2D space, as shown in Fig. 4. In Fig. 4A, the hybrid features show some degree of overlap between positive and negative samples, which is somewhat effective but does not accurately classify the targeted classes.However, in Fig. 4B, the data samples of both classes are clearly separable, demonstrating the effectiveness of SHAP-based optimal features in predicting between 5hmc and non-5hmc compared to the hybrid features in Fig. 4A.

XGBoost
XGBoost is defined as the extreme gradient boosting library.This optimized distributed gradient boosting library is meant for a faster, more flexible, and scalable machine learning process.At its core, the approach used by  XGBoost is called gradient boosting, one of several ensemble learning algorithms that create a predictive model by weighing the results of several weak learners, usually decision trees 40,41 .XGBoost provides a class of matrix that is meant for the exclusively efficient performance of XGBoost functions in data storage and access during model training and evaluation.The goal of regression tasks in XGBoost is to minimize the mean squared error (MSE) between the then-observed and the forecast values.XGBoost loss functions consist of squared error, absolute error, and Huber loss.The objective and loss functions in the XGBoost tutorial describe the criteria for the training optimization process.We use the objective function to measure the model overall and the loss function to compare the predicted and actual values.There are two stages of XGBoost during which training and evaluation occur repeatedly by adding new trees into the sum of the gain function.Each tree is non-parallelly matched to the negative slope of the loss function that showed the incorrect answers of the previous trees.XGBoost validation through cross-validation is an approach for ascertaining model success.It implies dividing the dataset into several subsets and using varied combinations of those sets to train the model at each stage and estimate the quality level.When building an XGBoost classifier, obtaining some binary or multiclass classification objective function, such as logistic Regression or Softmax, is often necessary.Users can move from the native API of XGBoost to the one that is scikit-learn powered; this allows them to switch both ways without restriction, which means compatibility with other machine-learning libraries can be achieved.This lays out the path through which the users will enjoy the function of the two APIs with their specific choice, as shown in Fig. 5.
XGBoost works by constructing a set of decision trees that succeed one another and, in the end, make the predictions.It starts by assembling a dataset with the required input features (X) on one side and physical measurements (Φ) on the other.The structure is made up of decision trees called T1 to TK.The purpose of each tree is to rectify and correct the mistakes made by the previous tree.The algorithm computes the residuals of the previous model and subsequently fits a new tree in each iteration to minimize these residuals.This procedure continues until a certain amount of tree value (K) is reached.Then, the algorithm accumulates the separate models f k (X, � k ) proposed by the individual trees to create a final prediction.Sequentially, it revises the model predictions addressing errors from the previous iteration, and finally, the approach delivers XGBoost highly accurate and robust forecasting for complex tasks.
XGBoost is an extreme gradient-boosting algorithm with various applications in machine learning, especially in regression and classification tasks.Know its efficiency, scalability, and high levels of predictive performance well.The logic is based on the concept of the tree ensemble, which represents an iterative process of building decision trees so that each tree rectifies the errors made by the previous ones.It optimizes a differentiable loss function with gradient descent, making it resilient and perform even in high-dimensional datasets.XGBoost main objective is to minimize the value of the loss function, which represents the divergence of the actual against the forecasted values.Here, several weak learners (decision trees), which are lightweight, are combined stepby-step into strong learners (ensemble model).The procedure sets the parameters of each tree as per the loss function gradient and gives them room to correct themselves over each iterative modification.The knowledge of the background and purposes of XGBoost constitutes a good base for studying the algorithmic details of the algorithm's superior performance in the machine learning application.• Compute negative gradient: • Compute second-order gradient (optional): • Train decision tree h t to predict negative gradient: • Apply regularization techniques and tree pruning: �(h) • Feature subsampling: γ 5. Evaluation: • Evaluate performance on validation set using evaluation metrics: L(y, f t (x)) 6. Stopping Criteria: • Check convergence or the predefined number of iterations: t if converged or t = T stop training 7. Output Model: • Return final ensemble of decision trees:

Performance evaluation matrix
The Performance Evaluation Matrix becomes the most precious tool for recognizing the working well or any flaws in the machine learning models 17,42 .It involves creating well-defined indicators to be used as model validation tools for assessing the performance in the different parameters (dimensions).Concision would be a plain yardstick of general correctness, while precision would weigh the model on false positives.The mind of the physician is called "Recall," which means the doctor's ability to identify all relevant instances correctly.At the same time, "Specificity"(false positives) is his ability to recognize negative misdiagnoses accurately.F1 Score makes the presence felt equally considering both precision and recall, especially when dealing with imbalanced sample space.MCC (Matthews Correlation Coefficient) is relevant in imbalanced datasets with several actual positive, false-negative, true-negative, and false-positive values covering all the items in the confusion matrix [43][44][45][46] .Moreover, classifier discrimination and tau that use ROC Curve and AUC measure the model's discriminatory power between classes.In contrast, the confusion matrix shows the number of correct and false predictions for each class.Alongside, these metrics offer a holistic evaluation tool that steers the way of the model to be more relevant to the current needs in machine learning technologies.

Experimental results and analysis
In machine learning, to thoroughly evaluate the model the prediction outcomes of an intelligent model can be examined using several cross-validation tests.Among these tests, independent test, and k-fold subsampling test are commonly used to enhance the performance outcomes of a hypothesis learner.In this paper, we used rigorous evaluation techniques, including the k-fold (i.e.k = 10) cross-validation test and the independent dataset test, to thoroughly examine the DNN model's performance.These methods were particularly effective in demonstrating the model's effectiveness and generalization ability.

Performance evaluation
In this section, we evaluate the prediction ability of the proposed model by applying various individual feature extraction techniques, including Mismatch, ANF, PSTNPss, ASDC, DAC, and composite features using different cross-validation tests.Table 1 presents the performance of the XGBoost model with various sequence formulation techniques, evaluated using fivefold cross-validation.We can observe from www.nature.com/scientificreports/space was reduced using the feature selection method.As a result, the success rate of the XGBoost model was significantly improved i.e. an average accuracy improved to 86.43% as shown in Table 1.Table 2 compares the accuracy gains of individual and hybrid features on the XGBoost classifier using a tenfold cross-validation test.We can see from Table 2, that the XGBoost model also obtained the highest performance using hybrid features compared with other sequence formulation methods.For example, the XGBoost model achieved average accuracy (i.e.86.87%) before applying the feature selection method.Similarly, the success rate of the XGBoost model further improved (i.e.89.97%) using the feature selection method as shown in Table 2.These results confirmed that the XGBoost model generates the best prediction performance on hybrid features using tenfold cross-validation.
In addition to the performance metrics, Fig. 6 presents the performance of the XGBoost model using various sequence formulation techniques, measured by different error metrics.Each technique represents different error measures, including Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Log Loss, and Mean Average Loss.Using these metrics together ensures a comprehensive evaluation of the model, capturing both the average performance and the impact of larger errors or misclassifications.For all these metrics, lower values indicate better model performance, as they signify smaller errors between the predicted and actual values.For instance, the feature selection using the SHAP technique achieves the best results with the lowest values in all metrics: MSE of 0.1234, MAE of 0.2345, RMSE of 0.3456, log loss of 0.4567, and mean average loss of 0.5678.In comparison, the without-feature selection approach shows increased values, reflecting a decline in performance, with MSE of 0.2345, MAE of 0.3456, RMSE of 0.4567, log loss of 0.5678, and mean an average loss of 0.6789.

Analysis of different learning classifiers
In this section, we compare the performance of the proposed model with other widely used machine learning algorithms using hybrid features.For the performance comparison, we considered the classifiers including Random Forest (RF), Support-Vector-Machine (SVM), K-Nearest Neighbor (KNN), Naive Bayes (NB), and Logistic Regression (LR).RF is an ensemble learning algorithm commonly used for different classification and regression  20,35 .KNN is an instance-based and non-parametric learning algorithm commonly used in the area of image processing.It measures Euclidian distance amongst the instances for classification purposes 19 .SVM is a powerful classification algorithm and is widely used in the area of bioinformatics 48 .It is used for both linear and non-linear classification problems.It computes optimal hyper-plane to differentiate between the classes 17 .NB is a simple yet powerful probabilistic classifier based on Bayes' theorem, assuming independence between features.It is highly efficient for large datasets and performs well with text classification tasks like spam detection.Despite its simplicity, Naive Bayes can achieve surprisingly high accuracy in various applications 37,[49][50][51][52][53] .LR is a widely used statistical method for binary classification, which models the probability of a binary outcome using a logistic function.It estimates the relationship between the dependent binary variable and one or more independent variables by using maximum likelihood estimation.Logistic Regression is popular due to its simplicity, interpretability, and effectiveness in many practical applications, including medical diagnosis and credit scoring 54 .The optimal parameters used for the training these models are provided in The overall experimental results demonstrated that the proposed XGBoost model can effectively detect the presence of 5hmC sites.Throughout the evaluation and comparison with other machine learning methods, XGBoost consistently outperformed the others in terms of accuracy, sensitivity, precision, specificity, F1 score, Matthew's correlation coefficient, and area under the curve.The theoretical approach of constructing syntactic models using various sequential creation methods enhanced the model's performance by incorporating a mix of hybrid features and feature selection, thereby improving prediction accuracy.A comparison of different methods identifies the XGBoost5hmC model as the most significant enhancer, surpassing all others in accuracy, sensitivity, specificity, F1 score, and MCC.The model's effectiveness and reliability were confirmed using a separate fold with tenfold cross-validation.The results indicate that the XGBoost model is feasible and valuable for accurately detecting 5hmC sites, providing crucial insights for future research and practical applications in bioinformatics and epigenetics.

Conclusion
This study design for XGB5hmC indicates the achievement of a critical goal in improved detection of the RNA modification species, and the particular species being logic here is the 5-hydroxymethylcytosine (5hmC).XGBoost, a known strong gradient boosting method favored by discriminative feature extraction, performs better in identifying 5hmC sites than the current models.The proposed XGB5hmC model using k-fold test achieved the predicted accuracy of 89.97%, demonstrating its effectiveness and reliability.Furthermore, the model highlights RNA modification maps' complexities by revealing gene expression regulatory ways and epigenetic control systems.The role of RNA modification understanding would undergo profound changes with the ability of RNAseq to predict the 5hmC alteration pattern precisely.The discovery of novel biological processes and their impacts on human health can be possible.The diversity of the XGB5hmc model is evident in its arsenal of hybrid features and advanced machine-learning approaches.The model is promising for the early diagnosis of the disease, and it has a very high impact on many areas, mainly cancer, diabetes, and the cardiovascular system. https://doi.org/10.1038/s41598-024-71568-zwww.nature.com/scientificreports/

Fig. 1 .
Fig. 1.Predictive Architecture for Identifying 5hmC Modifications in RNA Sequences.This figure depicts the architecture of the model used to predict 5-hydroxymethylcytosine (5hmC) modifications in RNA sequences.It outlines the steps from data input and feature extraction to the final machine learning model (XGBoost) used for prediction.The workflow highlights the integration of various computational techniques to identify 5hmC sites effectively.

Fig. 2 .
Fig. 2. SHAP-Based Feature Selection on the Hybrid Features.This figure illustrates the SHAP analysis used to select important features from the hybrid feature set, identifying those that contribute most significantly to the model's predictions.

Fig. 3 .
Fig. 3. LIME analysis on randomly selected instances after SHAP feature selection.This figure presents the LIME analysis applied to randomly selected instances, following SHAP-based feature selection, to provide interpretability of the model's predictions by highlighting the contribution of individual features.

Fig. 4 .
Fig. 4. (A) t-SNE visualization of (A) Hybrid Training features (B) SHAP-based Selected features.This figure shows the t-SNE visualization for two feature sets: (A) hybrid training features and (B) SHAP-based selected features.The visualization helps illustrate the clustering and separation of data points in different feature spaces.

Fig. 5 .
Fig. 5.The Workflow of the XGBoost Algorithm.This figure depicts the step-by-step workflow of the XGBoost algorithm, including data input, boosting process, and final model output.It illustrates how the algorithm enhances prediction accuracy through iterative training and optimization.

Fig. 8 .
Fig. 8. Comparative Performance Analysis of Machine Learning Models.This figure presents a comparative analysis of different machine learning models, highlighting their performance metrics such as accuracy, precision, and recall.It illustrates how each model, including XGBoost, fares in predicting outcomes and helps in evaluating their relative effectiveness.

Table 1 .
Table1that the XGBoost model yielded the best performance using hybrid features compared with other feature formulation methods.For instance, the XGBoost model achieved an average success rate of 85.61% before applying feature selection.In order to further improve the performance of the XGBoost model, the dimensionality of the hybrid features Performance evaluation of XGBoost model using fivefold cross-validation.

Table 2 .
Performance evaluation of XGBoost model using tenfold cross-validation.Performance Evaluation of XGBoost Model with Different Sequence Formulation Techniques and Error Metrics.This figure compares the performance of the XGBoost model using various sequence formulation techniques, highlighting key error metrics.It visualizes how different techniques affect the model's accuracy, precision, and other performance measures.

Table 3 .
The performance comparison of various algorithms is provided in Table4.According to Table4, XGBoost demonstrated the highest accuracy at 89.97%, outperforming all other classifiers across all measures.The Random Forest (RF) algorithm achieved the next highest accuracy at 82.56%.In terms of the Matthews Correlation Coefficient (MCC), which represents

Table 3 .
Optimal parameters used for the machine learning models.

Table 6 .
Performance comparison with existing models.